{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "20_newsgroups_automl.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.5.3"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "C7aGlpdRrgfu",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Copyright 2019 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     http://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xxG_Nzjqrgfy",
        "colab_type": "text"
      },
      "source": [
        "# Prepare data for *Google Cloud AutoML Natural Language* from scikit-learn"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SENX9Cxcrgfz",
        "colab_type": "text"
      },
      "source": [
        "## Overview"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MnBTX49Trgfz",
        "colab_type": "text"
      },
      "source": [
        "This notebook demonstrates how to prepare text data available in scikit-learn (or other libraries), so that it can be used in [Google Cloud AutoML Natural Language](https://cloud.google.com/natural-language/automl).\n",
        "\n",
        "The script reads the data into a pandas dataframe, and then makes some minor transformations to ensure that it is compatible with the AutoML Natural Language input specification. Finally, the CSV is saved into a CSV file, which can be downloaded from the notebook server."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aO6uxkAargf0",
        "colab_type": "text"
      },
      "source": [
        "## Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2tr1DfOiz9cb"
      },
      "source": [
        "This notebook downloads the [20 newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) using scikit-learn. This dataset contains about 18000 posts from 20 newsgroups, and is useful for text classification. More details on the dataset can be found [here](http://qwone.com/~jason/20Newsgroups/)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3YvJahe6rgf1",
        "colab_type": "text"
      },
      "source": [
        "## Objectives"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yYDsrG-5rgf2",
        "colab_type": "text"
      },
      "source": [
        "There are 3 goals for this notebook:\n",
        "1. Introduce scikit-learn datasets\n",
        "2. Explore pandas dataframe text manipulation\n",
        "3. Import data into AutoML Natural language for text classification"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fBN89gpLrgf3",
        "colab_type": "text"
      },
      "source": [
        "## What's next?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hFUQL3Rirgf3",
        "colab_type": "text"
      },
      "source": [
        "After downloading the CSV at the end of this notebook, import the data into [Google Cloud AutoML Natural Language](https://cloud.google.com/natural-language/automl) to explore classifying text."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "K65WZ6bMz9cc"
      },
      "source": [
        "## Imports"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "OZDimb-5z9cd",
        "colab": {}
      },
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "import csv\n",
        "\n",
        "from sklearn.datasets import fetch_20newsgroups"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zYxeG10oz9cg"
      },
      "source": [
        "## Fetch data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "mV7b2hHfz9ch",
        "colab": {}
      },
      "source": [
        "newsgroups = fetch_20newsgroups(subset='all')\n",
        "\n",
        "df = pd.DataFrame(newsgroups.data, columns=['text'])\n",
        "df['categories'] = [newsgroups.target_names[index] for index in newsgroups.target]\n",
        "df.head()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "RMJqpjZwz9cl"
      },
      "source": [
        "## Clean data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "K6yd6XSLz9cl",
        "colab": {}
      },
      "source": [
        "# Convert multiple whitespace characters into a space\n",
        "df['text'] = df['text'].str.replace('\\s+',' ')\n",
        "\n",
        "# Change newsgroup titles to use underscores rather than periods\n",
        "df['categories'] = df['categories'].str.replace('.','_')\n",
        "\n",
        "# Trim leading and tailing whitespace\n",
        "df['text'] = df['text'].str.strip()\n",
        "\n",
        "# Truncate all fields to the maximum field length of 128kB\n",
        "df['text'] = df['text'].str.slice(0,131072)\n",
        "\n",
        "# Remove any rows with empty fields\n",
        "df = df.replace('', np.NaN).dropna()\n",
        "\n",
        "# Drop duplicates\n",
        "df = df.drop_duplicates(subset='text')\n",
        "\n",
        "# Limit rows to maximum of 100,000\n",
        "df = df.sample(min(100000, len(df)))\n",
        "\n",
        "df.head()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "exFdr6wVz9co"
      },
      "source": [
        "## Export to CSV"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "pYUzEq-oz9cp",
        "colab": {}
      },
      "source": [
        "csv_str = df.to_csv(index=False, header=False)\n",
        "\n",
        "with open(\"20-newsgroups-dataset.csv\", \"w\") as text_file:\n",
        "    print(csv_str, file=text_file)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1kJX0DKHz9cr"
      },
      "source": [
        "You're all set! Download `20-newsgroups-dataset.csv` and import it into [Google Cloud AutoML Natural Language](https://cloud.google.com/natural-language/automl).\n",
        "\n",
        "If you are using [Google Colab](https://colab.research.google.com), you will find the file in the left navbar:\n",
        "\n",
        "\n",
        "*   From the menu, select **View > Table of Contents**\n",
        "*   Navigate to the **Files** tab\n",
        "*   Select `..` and find the file in `/content` directory\n",
        "*   Download the CSV with the context menu\n",
        "\n",
        "\n"
      ]
    }
  ]
}